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Linear Regression and Correlation: Introduction 

This module provides an introduction of Linear Regression and Correlation 
as a part of Collaborative Statistics collection (col10522) by Barbara 
Illowsky and Susan Dean. 


Student Learning Outcomes 
By the end of this chapter, the student should be able to: 


e Discuss basic ideas of linear regression and correlation. 
e Create and interpret a line of best fit. 

e Calculate and interpret the correlation coefficient. 

e Calculate and interpret outliers. 


Introduction 


Professionals often want to know how two or more numeric variables are 
related. For example, is there a relationship between the grade on the 
second math exam a student takes and the grade on the final exam? If there 
is a relationship, what is it and how strong is the relationship? 


In another example, your income may be determined by your education, 
your profession, your years of experience, and your ability. The amount you 
pay a repair person for labor is often determined by an initial amount plus 
an hourly fee. These are all examples in which regression can be used. 


The type of data described in the examples is bivariate data - "bi" for two 
variables. In reality, statisticians use multivariate data, meaning many 
variables. 


In this chapter, you will be studying the simplest form of regression, "linear 

regression" with one independent variable (_). This involves data that fits a 

line in two dimensions. You will also study correlation which measures how 
strong the relationship is. 


Linear Regression and Correlation: Linear Equations 

This module provides an overview of Linear Regression and Correlation: 
Linear Equations as a part of Collaborative Statistics collection (col10522) 
by Barbara Illowsky and Susan Dean. 


Linear regression for two variables is based on a linear equation with one 
independent variable. It has the form: 
Equation: 


where and_ are constant numbers. 
is the independent, or explanatory, variable, and _ is the dependent, 


or response, variable. Typically, you choose a value to substitute for the 
independent variable and then solve for the dependent variable. 


Example: 
The following examples are linear equations. 
Equation: 


Equation: 


The graph of a linear equation of the form is a straight line. 


Any line that is not vertical can be described by this equation. 


Example: 


15 


Graph of the equation 


Linear equations of this form occur in applications of life sciences, social 


sciences, psychology, business, economics, physical sciences, mathematics, 
and other areas. 


Example: 
Aaron's Word Processing Service (AWPS) does word processing. Its rate is 
$32 per hour plus a $31.50 one-time charge. The total cost to a customer 


depends on the number of hours it takes to do the word processing job. 
Exercise: 


Problem: 


Find the equation that expresses the total cost in terms of the number 
of hours required to finish the word processing job. 


Solution: 


Let =the number of hours it takes to get the job done. 
Let =the total cost to the customer. 


The $31.50 is a fixed cost. If it takes hours to complete the job, then 
is the cost of the word processing only. The total cost is: 


Linear Regression and Correlation: Slope and Y-Intercept of a Linear 
Equation 

This module provides an overview of Linear Regression and Correlation: 
Slope and Y-Intercept of a Linear Equation as a part of Collaborative 
Statistics collection (col10522) by Barbara Illowsky and Susan Dean. 


For the linear equation , =slopeand =y-intercept. Note this 
is similar to , but it is not the same. 


From algebra recall that the slope is a number that describes the steepness 
of a line and the y-intercept is the y coordinate of the point where the 
line crosses the y-axis. 


If , the line If , the If , the line slopes 
slopes upward to the line is downward to the right. 
right. horizontal. 


Ye | | = 


Three possible graphs of 


The slope is how much the response variable will increase/decrease by as 
the explanatory variable increases by 1. The y-intercept is the value of the 
response variable when the explanatory variable is 0. 


Example: 

A 

Svetlana tutors to make extra money for college. For each tutoring session, 
she charges a one time fee of $25 plus $15 per hour of tutoring. A linear 


equation that expresses the total amount of money Svetlana earns for each 
session she tutors is 

Exercise: 

A 


Problem: 


What are the independent and dependent variables? What is the y- 
intercept and what is the slope? Interpret them using complete 
sentences. 


Solution: 


The independent variable (x) is the number of hours Svetlana tutors 
each session. The dependent variable (y) is the amount, in dollars, 
Svetlana earns for each session. 


The y-intercept is 25 (a = 25). At the start of the tutoring session, 
Svetlana charges a one-time fee of $25 (this is when x = 0). The slope 
is 15 (b = 15). For each session, Svetlana earns $15 for each hour she 
tutors. 


Example: 

B 

The equation to predict your college GPA given your high school GPA is 
where x is your high school GPA. 

Exercise: 

B 


Problem: 


What are the explanatory and response variables? Interpret the slope 
and y-intercept using complete sentences. 


Solution: 


The explanatory variable is high school GPA. The response variable is 
college GPA. The slope is .66; for each additional point on your high 
school GPA your college GPA should increase by .66 points. The y- 
intercept is 1.11; a person with a high school GPA of 0 is expected to 
have a college GPA of 1.1 


Linear Regression and Correlation: Scatter Plots 

This module provides an overview of Linear Regression and Correlation: 
Scatter Plots as a part of Collaborative Statistics collection (col10522) by 
Barbara Illowsky and Susan Dean. 


Before we take up the discussion of linear regression and correlation, we 
need to examine a way to display the relation between two variables and 

. The most common and easiest way is a scatter plot. The following 
example illustrates a scatter plot. 


Example: 

From an article in the Wall Street Journal: In Europe and Asia, m- 
commerce is popular. M-commerce users have special mobile phones that 
work like electronic wallets as well as provide phone and Internet services. 
Users can do everything from paying for parking to buying a TV set or 
soda from a machine to banking to checking sports scores on the Internet. 
For the years 2000 through 2004, was there a relationship between the year 
and the number of m-commerce users? Construct a scatter plot. Let =the 
year and let = the number of m-commerce users, in millions. 


Table showing the number of Scatter plot showing the number 
m-commerce users (in of m-commerce users (in 
millions) by year. millions) by year. 

50 
5 
4S 
+ 
(year) (# of users) . 

0 

2000 0.5 2000 2002.~=—-2004 
X = year 


2002 20.0 


(year) (# of users) 
2003 33.0 


2004 47.0 


A scatter plot shows the direction/trend and strength of a relationship 
between the variables. A clear direction happens when there is either: 


¢ High values of one variable occurring with high values of the other 
variable or low values of one variable occurring with low values of the 
other variable. (a positive trend) 

e High values of one variable occurring with low values of the other 
variable. (a negative trend) 


You can determine the strength of the relationship by looking at the scatter 
plot and seeing how close the points are to a line, a power function, an 
exponential function, or to some other type of function. 


When you look at a scatterplot, you want to notice the overall pattern and 
any deviations from the pattern. The following scatterplot examples 
illustrate these concepts. 

Positive Linear Pattern (Strong) 


Linear Pattern w/ One Deviation 


Exponential Growth Pattern 


o 
eo 
@e0ee ® 
No Pattern 
ee co) 
®e e 
e oO 
eo? fo} 


In this chapter, we are interested in scatter plots that show a linear pattern. 
Linear patterns are quite common. The linear relationship is strong if the 
points are close to a straight line. If we think that the points show a linear 
relationship, we would like to draw a line on the scatter plot. This line can 
be calculated through a process called linear regression. However, we only 
calculate a regression line if one of the variables helps to explain or predict 


the other variable. If is the independent variable and the dependent 
variable, then we can use a regression line to predict fora given value of 


Linear Regression and Correlation: The Regression Equation 

Linear Regression and Correlation: The Regression Equation is a part of 
Collaborative Statistics collection (col10522) by Barbara Illowsky and 
Susan Dean. Contributions from Roberta Bloom include instructions for 
finding and graphing the regression equation and scatterplot using the 
LinRegTTest on the TI-83,83+,84+ calculators. 


Data rarely fit a straight line exactly. Usually, you must be satisfied with 
rough predictions. Typically, you have a set of data whose scatter plot 
appears to "fit" a straight line. This is called a Line of Best Fit or Least 
Squares Line. 


Optional Collaborative Classroom Activity 


If you know a person's pinky (smallest) finger length, do you think you 
could predict that person's height? Collect data from your class (pinky 
finger length, in inches). The independent variable, z, is pinky finger length 
and the dependent variable, y, is height. 


For each set of data, plot the points on graph paper. Make your graph big 
enough and use a ruler. Then "by eye" draw a line that appears to "fit" the 
data. For your line, pick two convenient points and use them to find the 
slope of the line. Find the y-intercept of the line by extending your lines so 
they cross the y-axis. Using the slopes and the y-intercepts, write your 
equation of "best fit". Do you think everyone will have the same equation? 
Why or why not? 


Using your equation, what is the predicted height for a pinky length of 2.5 
inches? 


Example: 

A random sample of 11 statistics students produced the following data 
where z is the third exam score, out of 80, and y is the final exam score, 
out of 200. Can you predict the final exam score of a random student if you 
know the third exam score? 


Table showing the scores on Scatter plot showing the scores 


the final exam based on scores on the final exam based on scores 
from the third exam. from the third exam. 
250 
8 200 - fe) 
E 150 - 4 as eee 
hy 100 - 
y (final iz 7 | 
x (third exam 60 65 70 75 80 
exam score) score) Third Exam Score 
65 75 
67 188) 
Ta 185 
TEs 163 
66 126 
75 198 
67 58) 
70 163 
7A 159 
69 151 


69 Sg 


The third exam score, 2, is the independent variable and the final exam 
score, y, is the dependent variable. We will plot a regression line that best 
"fits" the data. If each of you were to fit a line "by eye", you would draw 
different lines. We can use what is called a least-squares regression line to 
obtain the best fit line. 


Consider the following diagram. Each point of data is of the the form (a, y) 
and each point of the line of best fit using least-squares linear regression has 
the form (z, g). 


The y is read "y hat" and is the estimated value of y. It is the value of y 
obtained using the regression line. It is not generally equal to y from data. 


data point = (x,, y,) 


distance = ly, - y. =|e, 


point on line = (x, y,) 


64 69 74 


The term yo — Yo = €o is called the "error" or residual. It is not an error 
in the sense of a mistake. The absolute value of a residual measures the 
vertical distance between the actual value of y and the estimated value of y. 
In other words, it measures the vertical distance between the actual data 
point and the predicted point on the line. 


If the observed data point lies above the line, the residual is positive, and 
the line underestimates the actual data value for y. If the observed data 
point lies below the line, the residual is negative, and the line overestimates 
that actual data value for y. 


In the diagram above, yo — Yo = €o is the residual for the point shown. Here 
the point lies above the line and the residual is positive. 


€ = the Greek letter epsilon 


For each data point, you can calculate the residuals or errors, y; — Y; = &; 
fOr d= Ls 2. By aver Ld 


Each |e| is a vertical distance. 


For the example about the third exam scores and the final exam scores for 
the 11 statistics students, there are 11 data points. Therefore, there are 11 € 
values. If you square each € and add, you get 


11 
(2) (s)rr pe -be 


This is called the Sum of Squared Errors (SSE). 


Using calculus, you can determine the values of a and b that make the SSE 
a minimum. When you make the SSE a minimum, you have determined the 
points that are on the line of best fit. It turns out that the line of best fit has 
the equation: 


Equation: 
y=a+bx 
where a = y—b- x andb= oa 


x and y are the sample means of the x values and the y values, respectively. 
The best fit line always passes through the point (z, y). 


The slope 6 can be written as b = r- (=) where s, = the standard 


deviation of the y values and s,. = the standard deviation of the x values. r 
is the correlation coefficient which is discussed in the next section. 


Least Squares Criteria for Best Fit 

The process of fitting the best fit line is called linear regression. The idea 
behind finding the best fit line is based on the assumption that the data are 
scattered about a straight line. The criteria for the best fit line is that the 
sum of the squared errors (SSE) is minimized, that is made as small as 
possible. Any other line you might choose would have a higher SSE than 
the best fit line. This best fit line is called the least squares regression line 


Note:Computer spreadsheets, statistical software, and many calculators 
can quickly calculate the best fit line and create the graphs. The 
calculations tend to be tedious if done by hand. Instructions to use the TI- 
83, TI-83+, and TI-84+ calculators to find the best fit line and create a 
scatterplot are shown at the end of this section. 


THIRD EXAM vs FINAL EXAM EXAMPLE: 
The graph of the line of best fit for the third exam/final exam example is 
shown below: 


250 - 
200 - 
i) 
- : 
Mm 150 - 
= e ® 
& 100 - 
| 
= 50 
fe 
0 i 
64 69 74 
Third Exam Score 


The least squares regression line (best fit line) for the third exam/final exam 
example has the equation: 
Equation: 


j = —-173.51 + 4.83x 


Note: 


e Remember, it is always important to plot a scatter diagram first. If the 
scatter plot indicates that there is a linear relationship between the 
variables, then it is reasonable to use a best fit line to make 
predictions for y given x within the domain of z-values in the sample 
data, but not necessarily for z-values outside that domain. 

e You could use the line to predict the final exam score for a student 
who eared a grade of 73 on the third exam. 

e You should NOT use the line to predict the final exam score for a 
student who earned a grade of 50 on the third exam, because 50 is not 
within the domain of the x-values in the sample data, which are 
between 65 and 75. 


UNDERSTANDING SLOPE 

The slope of the line, b, describes how changes in the variables are related. 
It is important to interpret the slope of the line in the context of the situation 
represented by the data. You should be able to write a sentence interpreting 
the slope in plain English. 


INTERPRETATION OF THE SLOPE: The slope of the best fit line tells 
us how the dependent variable (y) changes for every one unit increase in the 
independent (x) variable, on average. 

THIRD EXAM vs FINAL EXAM EXAMPLE 


e Slope: The slope of the line is b = 4.83. 


e Interpretation: For a one point increase in the score on the third exam, 
the final exam score increases by 4.83 points, on average. 


Using the TI-83+ and TI-84+ Calculators 
Using the Linear Regression T Test: LinRegT Test 


In the STAT list editor, enter the X data in list L1 and the Y data in list L2, 
paired so that the corresponding (x,y) values are next to each other in the 
lists. (If a particular pair of values is repeated, enter it as many times as it 
appears in the data.) 

On the STAT TESTS menu, scroll down with the cursor to select the 
LinRegT Test. (Be careful to select LinRegTTest as some calculators may 
also have a different item called LinRegTInt.) 

On the LinRegTTest input screen enter: Xlist: L1 ; Ylist: L2 ; Freq: 1 

On the next line, at the prompt £ or p, highlight "4 0" and press ENTER 
Leave the line for "RegEq:" blank 

Highlight Calculate and press ENTER. 


LinRegTTest Input Screen and Output Screen 


LinRegT Test 
Xlist: L1 
Ylist: L2 
Freq: 1 


LinRegTTest 
y=at+bx 

£#0 and p#0 
t= 2.657560155 
p = .0261501512 
df=9 


B or p :[#0]<0 >0 
RegEQ: 
Calculate 


Ja = -173.513363 
b = 4.827394209 


s = 16.41237711 
r2= .4396931104 
r= 663093591 


TI-83+ and TI-84+ 
calculators 


The output screen contains a lot of information. For now we will focus on a 
few items from the output, and will return later to the other items. 


e The second line says y=at+bx. Scroll down to find the values 
a=-173.513, and b=4.8273 ; the equation of the best fit line is 
y = —173.51+ 4.832 

¢ The two items at the bottom are r? = .43969 and r=.663. For now, just 
note where to find these values; we will discuss them in the next two 
sections. 


Graphing the Scatterplot and Regression Line 


We are assuming your X data is already entered in list L1 and your Y data is 
in list L2 

Press 2nd STATPLOT ENTER to use Plot 1 

On the input screen for PLOT 1, highlightOnand press ENTER 

For TYPE: highlight the very first icon which is the scatterplot and press 
ENTER 

Indicate Xlist: L1 and Ylist: L2 

For Mark: it does not matter which symbol you highlight. 

Press the ZOOM key and then the number 9 (for menu item "ZoomStat") ; 
the calculator will fit the window to the data 

To graph the best fit line, press the "Y=" key and type the equation 
-173.5+4.83X into equation Y1. (The X key is immediately left of the STAT 
key). Press ZOOM 9 again to graph it. 

Optional: If you want to change the viewing window, press the WINDOW 
key. Enter your desired window using Xmin, Xmax, Ymin, Ymax 


**With contributions from Roberta Bloom 


Linear Regression and Correlation: Correlation Coefficient and Coefficient 
of Determination 

Linear Regression and Correlation: The Correlation Coefficient and 
Coefficient of Determination is a part of Collaborative Statistics collection 
(col10522) by Barbara Illowsky and Susan Dean with contributions from 
Roberta Bloom. The name has been changed from Correlation Coefficient. 


The Correlation Coefficient r 


Besides looking at the scatter plot and seeing that a line seems reasonable, 
how can you tell if the line is a good predictor? Use the correlation 
coefficient as another indicator (besides the scatterplot) of the strength of 
the relationship between x and y. 


The correlation coefficient, r, developed by Karl Pearson in the early 
1900s, is a numerical measure of the strength of association between the 
independent variable x and the dependent variable y. 


We will primarily use the calculator to find r, but the correlation coefficient 
is calculated as 
Equation: 


n+ Se-y— (Lz) - (Ly) 
In - Se? — (Ler)*]-[n- Ly? — (Ly)? 


where n = the number of data points. 


If you suspect a linear relationship between z and y, then r can measure 
how strong the linear relationship is. 
What the VALUE of r tells us: 


e The value of r is always between -1 and +1: -1 <r <1. 

e The size of the correlation r indicates the strength of the linear 
relationship between z and y. Values of r close to -1 or to +1 indicate a 
stronger linear relationship between x and y. 


e If r=0 there is absolutely no linear relationship between x and y (no 
linear correlation). 

e If r = 1, there is perfect positive correlation. If 7 = —1, there is 
perfect negative correlation. In both these cases, all of the original data 
points lie on a straight line. Of course, in the real world, this will not 
generally happen. 


What the SIGN of r tells us 


e A positive value of r means that when z increases, y tends to increase 
and when « decreases, y tends to decrease (positive correlation). 

e A negative value of r means that when z increases, y tends to decrease 
and when z decreases, y tends to increase (negative correlation). 

e The sign of r is the same as the sign of the slope, b, of the best fit line. 


Note:Strong correlation does not suggest that 2 causes y or y causes x. We 
say "correlation does not imply causation." For example, every person 
who learned math in the 17th century is dead. However, learning math does 
not necessarily cause death! 


Positive Correlation 


A scatter plot 
showing data 
with a 
positive 
correlation. 
O0<r<l 


Negative Correlation 


A scatter plot 
showing data 
with a 
negative 
correlation. 
—-l<r<0 


Zero Correlation 


A scatter plot 
showing data 
with zero 
correlation. r 
=0 


The formula for r looks formidable. However, computer spreadsheets, 
statistical software, and many calculators can quickly calculate r. The 
correlation coefficient r is the bottom item in the output screens for the 


LinRegTTest on the TI-83, TI-83+, or TI-84+ calculator (see previous 
section for instructions). 


The Coefficient of Determination 

r? is called the coefficient of determination. r? is the square of the 
correlation coefficient , but is usually stated as a percent, rather than in 
decimal form. r” has an interpretation in the context of the data: 


e r”, when expressed as a percent, represents the percent of variation in 
the dependent variable y that can be explained by variation in the 
independent variable x using the regression (best fit) line. 

e 1-r?, when expressed as a percent, represents the percent of variation 
in y that is NOT explained by variation in x using the regression line. 
This can be seen as the scattering of the observed data points about the 
regression line. 


Consider the third exam/final exam example introduced in the previous 
section 


e The line of best fit is: 7 = —173.51 + 4.83x 

e The correlation coefficient is r = 0.6631 

¢ The coefficient of determination is r? = 0.66317 = 0.4397 

¢ Interpretation of r? in the context of this example: 

e Approximately 44% of the variation (0.4397 is approximately 0.44) in 
the final exam grades can be explained by the variation in the grades 
on the third exam, using the best fit regression line. 

e Therefore approximately 56% of the variation (1 - 0.44 = 0.56) in the 
final exam grades can NOT be explained by the variation in the grades 
on the third exam, using the best fit regression line. (This is seen as the 
scattering of the points about the line.) 


**With contributions from Roberta Bloom. 


Glossary 


Coefficient of Correlation 
A measure developed by Karl Pearson (early 1900s) that gives the 
strength of association between the independent variable and the 
dependent variable. The formula is: 


Equation: 


ndxy- Ol #)QO ly) 
Jin e? — (Sz) Cy - Oly)? 
where n is the number of data points. The coefficient cannot be more 


then 1 and less then -1. The closer the coefficient is to +1, the stronger 
the evidence of a significant linear relationship between x and y. 


Linear Regression and Correlation: Summary of the Correlation Coefficient 
for Linear Regression 

This module provides an overview of Facts About the Correlation 
Coefficient for Linear Regression as a part of Collaborative Statistics 
collection (col10522) by Barbara Illowsky and Susan Dean. 


e A positive meansthat when increases, increases and when 
decreases, decreases (positive correlation). 

e Anegative means that when increases, decreases and when 
decreases, increases (negative correlation). 

e An_ of zero means there is absolutely no linear relationship between 
and (no correlation). 

e High correlation does not suggest that causes or causes .We 
say "correlation does not imply causation." For example, every 
person who learned math in the 17th century is dead. However, 
learning math does not necessarily cause death! 


Positive Correlation 


A scatter plot 
showing data 
with a 
positive 
correlation. 


Negative Correlation 


A scatter plot 
showing data 
with a 
negative 
correlation. 


Zero Correlation 


A scatter plot 
showing data 
with zero 
correlation. 


If or , then all the data points lie exactly on a straight line. 


If the linear correlation is strong, then the line can be used to predict a 
value. 


Linear Regression and Correlation: Prediction 

Linear Regression and Correlation: Prediction is a part of Collaborative 
Statistics collection (col10522) by Barbara Illowsky and Susan Dean with 
contributions from Roberta Bloom. 


Recall the third exam/final exam example. 


We examined the scatterplot and showed that the correlation coefficient is 
significant. We found the equation of the best fit line for the final exam 
grade as a function of the grade on the third exam. We can now use the least 
squares regression line for prediction. 


Suppose you want to estimate, or predict, the final exam score of statistics 
students who received 73 on the third exam. The exam scores ( -values) 
range from 65 to 75. Since 73 is between the -values 65 and 75, 
substitute into the equation. Then: 

Equation: 


We predict that statistic students who earn a grade of 73 on the third exam 
will earn a grade of 179.08 on the final exam, on average. 


Example: 
Recall the third exam/final exam example. 
Exercise: 


Problem: 


What would you predict the final exam score to be for a student who 
scored a 66 on the third exam? 


Solution: 


145.27 


Exercise: 


Problem: 


What would you predict the final exam score to be for a student who 
scored a 90 on the third exam? 


Solution: 


The x values in the data are between 65 and 75. 90 is outside of the 
domain of the observed x values in the data (independent variable), so 
you cannot reliably predict the final exam score for this student. (Even 
though it is possible to enter x into the equation and calculate a y 
value, you should not do so!) 


To really understand how unreliable the prediction can be outside of 
the observed x values in the data, make the substitution x = 90 into the 
equation. 


The final exam score is predicted to be 261.19. The largest the final 
exam score can be is 200. 


Note:The process of predicting inside of the observed x values in the 
data is called interpolation. The process of predicting outside of the 
observed x values in the data is called extrapolation. 


**With contributions from Roberta Bloom 


Linear Regression and Correlation: Outliers 

Linear Regression and Correlation: Outliers is a part of Collaborative 
Statistics collection (col10522) by Barbara Illowsky and Susan Dean. The 
module has been modified to include a graphical method for identifying 
outliers contributed by Roberta Bloom. 


In some data sets, there are values (observed data points) called outliers. 
Outliers are observed data points that are far from the least squares 
line. They have large "errors", where the "error" or residual is the vertical 
distance from the line to the point. 


Outliers need to be examined closely. Sometimes, for some reason or 
another, they should not be included in the analysis of the data. It is possible 
that an outlier is a result of erroneous data. Other times, an outlier may hold 
valuable information about the population under study and should remain 
included in the data. The key is to carefully examine what causes a data 
point to be an outlier. 


Besides outliers, a sample may contain one or a few points that are called 
influential points. Influential points are observed data points that are far 
from the other observed data points in the horizontal direction. These points 
may have a big effect on the slope of the regression line. To begin to 
identify an influential point, you can remove it from the data set and see if 
the slope of the regression line is changed significantly. 


Computers and many calculators can be used to identify outliers from the 
data. Computer output for regression analysis will often identify both 
outliers and influential points so that you can examine them. 


Identifying Outliers 

We could guess at outliers by looking at a graph of the scatterplot and best 
fit line. However we would like some guideline as to how far away a point 
needs to be in order to be considered an outlier. As a rough rule of thumb, 
we can flag any point that is located further than two standard 
deviations above or below the best fit line as an outlier. The standard 
deviation used is the standard deviation of the residuals or errors. 


We can do this visually in the scatterplot by drawing an extra pair of lines 
that are two standard deviations above and below the best fit line. Any data 
points that are outside this extra pair of lines are flagged as potential 
outliers. Or we can do this numerically by calculating each residual and 
comparing it to twice the standard deviation. On the TI-83, 83+, or 84+, the 
graphical approach is easier. The graphical procedure is shown first, 
followed by the numerical calculations. You would generally only need to 
use one of these methods. 


Example: 
Exercise: 


Problem: 


In the third exam/final exam example, you can determine if there is an 
outlier or not. If there is an outlier, as an exercise, delete it and fit the 
remaining data to a new line. For this example, the new line ought to 
fit the remaining data better. This means the SSE should be smaller 
and the correlation coefficient ought to be closer to 1 or -1. 


Solution: 


Graphical Identification of Outliers 

With the TI-83,83+,84+ graphing calculators, it is easy to identify the 
outlier graphically and visually. If we were to measure the vertical 
distance from any data point to the corresponding point on the line of 
best fit and that distance was equal to 2s or farther, then we would 
consider the data point to be "too far" from the line of best fit. We 
need to find and graph the lines that are two standard deviations 
below and above the regression line. Any points that are outside these 
two lines are outliers. We will call these lines Y2 and Y3: 


As we did with the equation of the regression line and the correlation 
coefficient, we will use technology to calculate this standard deviation 
for us. Using the LinRegT Test with this data, scroll down through the 
output screens to find s=16.412 


Line Y2=-173.5+4.832%-2(16.4) and line Y3=-173.5+4.832+2(16.4) 


where Y=-173.5+4.83x is the line of best fit. Y2 and Y3 have the same 
slope as the line of best fit. 


Graph the scatterplot with the best fit line in equation Y1, then enter 
the two extra lines as Y2 and Y3 in the "Y="equation editor and press 
ZOOM 9. You will find that the only data point that is not between 
lines Y2 and Y3 is the point x=65, y=175. On the calculator screen it 
is just barely outside these lines. The outlier is the student who had a 
grade of 65 on the third exam and 175 on the final exam; this point is 
further than 2 standard deviations away from the best fit line. 


Sometimes a point is so close to the lines used to flag outliers on the 
graph that it is difficult to tell if the point is between or outside the 
lines. On a computer, enlarging the graph may help; on a small 
calculator screen, zooming in may make the graph clearer. Note that 
when the graph does not give a clear enough picture, you can use the 
numerical comparisons to identify outliers. 

[missing_resource: linrgoutlier.gif] 


Numerical Identification of Outliers 

In the table below, the first two columns are the third exam and final 
exam data. The third column shows the predicted ¥ values calculated 
from the line of best fit: y=-173.5+4.83x. The residuals, or errors, 
have been calculated in the fourth column of the table: 

observed y value — predicted y value = y — y. 


s is the standard deviation of all the y — y = e€ values where n = the 
total number of data points. If each residual is calculated and squared, 
and the results are added, we get the SSE. The standard deviation of 
the residuals is calculated from the SSE as: 


— , f See 
= n—2 


Rather than calculate the value of s ourselves, we can find s using the 
computer or calculator. For this example, the calculator function 


LinRegTTest found s = 16.4 as the standard deviation of the residuals 
35 -17 16 -6 -1993 -1-10-9-1. 


x y y y-y 

65 175 140 175 — 140 = 35 
67 133 150 133 — 150 = —17 
71 185 169 185 — 169 = 16 
71 163 169 163 — 169 = —6 
66 126 145 126 — 145 = —-19 
75 198 189 198 — 189 = 9 

67 153 150 153 —150=3 

70 163 164 163 — 164 = -1 
71 159 169 159 — 169 = —10 
69 151 160 151 — 160 = —9 
69 159 160 159 — 160 = -1 


We are looking for all data points for which the residual is greater 
than 2s=2(16.4)=32.8 or less than -32.8. Compare these values to the 
residuals in column 4 of the table. The only such data point is the 
student who had a grade of 65 on the third exam and 175 on the final 
exam; the residual for this student is 35. 


How does the outlier affect the best fit line? 

Numerically and graphically, we have identified the point (65,175) as 
an outlier. We should re-examine the data for this point to see if there 
are any problems with the data. If there is an error we should fix the 
error if possible, or delete the data. If the data is correct, we would 
leave it in the data set. For this problem, we will suppose that we 
examined the data and found that this outlier data was an error. 
Therefore we will continue on and delete the outlier, so that we 
can explore how it affects the results, as a learning experience. 


Compute a new best-fit line and correlation coefficient using the 
10 remaining points: 

On the TI-83, TI-83+, TI-84+ calculators, delete the outlier from L1 
and L2. Using the LinRegTTest, the new line of best fit and the 
correlation coefficient are: 


y = —355.19 + 7.39x andr = 0.9121 


The new line with r = 0.9121 is a stronger correlation than the 
original (r=0.6631) because r = 0.9121 is closer to 1. This means 
that the new line is a better fit to the 10 remaining data values. The 
line can better predict the final exam score given the third exam score. 


Numerical Identification of Outliers: Calculating s and Finding 
Outliers Manually 


If you do not have the function LinRegTTest, then you can calculate the 
outlier in the first example by doing the following. 


First, square each |y — ¢| (See the TABLE above): 
The squares are 352 177 162 6? 192 9? 3? 17 10? 9? 1? 


Then, add (sum) all the |y — g| squared terms using the formula 


A 


Yi — Vi 


11 
* ( 
i=l 


= 357+ 177+ 167+ 624+ 1974+ 97+ 324+ 12741074974 12 


11 
i = X ei” (Recall that y; — Y = &.) 


= 2440 = SSE. The result, SSE is the Sum of Squared Errors. 


Next, calculate s, the standard deviation of all the y — 4 = € values 
where n = the total number of data points. 


The calculation is s = 4/ SSE. 


For the third exam/final exam problem, s = / aa = 16.47 


Next, multiply s by 1.9: 

(1.9) - (16.47) = 31.29 

31.29 is almost 2 standard deviations away from the mean of the y — y 
values. 


If we were to measure the vertical distance from any data point to the 
corresponding point on the line of best fit and that distance is at least 1.9s, 
then we would consider the data point to be "too far" from the line of best 
fit. We call that point a potential outlier. 


For the example, if any of the |y — | values are at least 31.29, the 
corresponding (, y) data point is a potential outlier. 


For the third exam/final exam problem, all the ly — y's are less than 31.29 


except for the first one which is 35. 
y—§| > (1.9) - (s) 


The point which corresponds to |y — y| = 35 is (65, 175). Therefore, the 
data point (65, 175) is a potential outlier. For this example, we will delete 
it. (Remember, we do not always delete an outlier.) 


35 > 31.29 That is, 


The next step is to compute a new best-fit line using the 10 remaining 
points. The new line of best fit and the correlation coefficient are: 


Yy = —355.19 + 7.39x2 and r = 0.9121 


Example: 
Exercise: 


Problem: 


Using this new line of best fit (based on the remaining 10 data points), 
what would a student who receives a 73 on the third exam expect to 
receive on the final exam? Is this the same as the prediction made 
using the original line? 


Solution: 


Using the new line of best fit, 7 = —355.19 + 7.39(73) = 184.28. A 
student who scored 73 points on the third exam would expect to earn 
184 points on the final exam. 


The original line predicted ¥ = —173.51 + 4.83(73) = 179.08 so 
the prediction using the new line with the outlier eliminated differs 
from the original prediction. 


Example: 

(From The Consumer Price Indexes Web site) The Consumer Price Index 
(CPI) measures the average change over time in the prices paid by urban 
consumers for consumer goods and services. The CPI affects nearly all 
Americans because of the many ways it is used. One of its biggest uses is 
as a measure of inflation. By providing information about price changes in 
the Nation's economy to government, business, and labor, the CPI helps 
them to make economic decisions. The President, Congress, and the 


Federal Reserve Board use the CPI's trends to formulate monetary and 
fiscal policies. In the following table, x is the year and y is the CPI. 


x 7] 
1915 10.1 
1926 leew 
1935 ee, 
1940 14.7 
1947 24.1 
1952 26.5 
1964 31.0 
1969 36.7 
1975 49.3 
1979 72.6 
1980 82.4 
1986 109.6 
1991 130.7 


Data: 


Exercise: 


Problem: 


e Make a scatterplot of the data. 

e Calculate the least squares line. Write the equation in the form 
G—a-+ be. 

e Draw the line on the scatterplot. 

Find the correlation coefficient. Is it significant? 

What is the average CPI for the year 1990? 


Solution: 


e Scatter plot and line of best fit. 

° y = —3204 + 1.6622 is the equation of the line of best fit. 

r = 0.8694 

e The number of data points isn = 14. Use the 95% Critical 

Values of the Sample Correlation Coefficient table at the end of 

Chapter 12. n — 2 = 12. The corresponding critical value is 

0.532. Since 0.8694 > 0.532, r is significant. 

gj = —3204 + 1.662(1990) = 103.4 CPI 

e Using the calculator LinRegTTest, we find that s = 25.4 ; 
graphing the lines Y2=-3204+1.662X-2(25.4) and 
Y3=-3204+1.662X+2(25.4) shows that no data values are outside 
those lines, identifying no outliers. (Note that the year 1999 was 
very close to the upper line, but still inside it.) 


CPI 
Oo 
na 


11 


14 900 1911 1922 1933 1944 1955 1966 1977 1988 19992010 
Year 


Note:In the example, notice the pattern of the points compared to the line. 
Although the correlation coefficient is significant, the pattern in the 
scatterplot indicates that a curve would be a more appropriate model to 
use than a line. In this example, a statistician should prefer to use other 
methods to fit a curve to this data, rather than model the data with the line 
we found. In addition to doing the calculations, it is always important to 
look at the scatterplot when deciding whether a linear model is 
appropriate. 


If you are interested in seeing more years of data, visit the Bureau of Labor 
Statistics CPI website ftp://ftp.bls.go0v/pub/special.requests/cpi/cpiai.txt ; 
our data is taken from the column entitled "Annual Avg." (third column 
from the right). For example you could add more current years of data. Try 
adding the more recent years 2004 : CPI=188.9, 2008 : CPI=215.3 and 
2011: CPI=224.9. See how it affects the model. (Check: 

y = —4436 + 2.295z. r = 0.9018. Is r significant? Is the fit better with 
the addition of the new points?) 


**With contributions from Roberta Bloom 


Glossary 


Outlier 
An observation that does not fit the rest of the data. 


Linear Regression and Correlation: Summary 

This module provides a summary on Linear Regression and Correlation as a 
part of Collaborative Statistics collection (col10522) by Barbara Illowsky 
and Susan Dean. 


Bivariate Data: Each data point has two values. The form is 
Line of Best Fit or Least Squares Line (LSL): 
= independent variable; = dependent variable 


Residual: 
Correlation Coefficient r: 


1. Used to determine whether a line of best fit is good for prediction. 

2. Between -1 and 1 inclusive. The closer is to 1 or -1, the closer the 
original points are to a straight line. 

3.If is negative, the slope is negative. If is positive, the slope is 
positive. 

ALT , then the line is horizontal. 


Sum of Squared Errors (SSE): The smaller the SSE, the better the 
original set of points fits the line of best fit. 


Outlier: A point that does not seem to fit the rest of the data. 


Linear Regression and Correlation: Practice 

This module provides a practice of Linear Regression and Correlation as a 
part of Collaborative Statistics collection (col10522) by Barbara Illowsky 
and Susan Dean. 


Student Learning Outcomes 


e The student will evaluate bivariate data and determine if a line is an 
appropriate fit to the data. 


Given 


Below are real data for the first two decades of AIDS reporting. (Source: 
Centers for Disease Control and Prevention, National Center for HIV, STD, 
and TB Prevention) 


Year # AIDS cases diagnosed # AIDS deaths 
Pre-1981 91 29 

1981 319 121 

1982 1,170 453 

1983 3,076 1,482 

1984 6,240 3,466 

1985 11,776 6,878 


1986 19,032 11,987 


1987 28,564 16,162 


1988 35,447 20,868 
1989 42,674 2759 | 
1990 48,634 31,335 
1991 59,660 36,560 
1992 78,530 41,055 
1993 78,834 44,730 
1994 71,874 49,095 
1995 68,505 49,456 
1996 Do,047 38,510 
1997 47,149 20,736 
1998 38,393 19,005 
1999 25,174 18,454 
2000 20,02 2 17,347 
2001 25,643 17,402 
2002 26,464 16,371 
Total 802,118 489,093 


Adults and Adolescents only, United States 


Note:We will use the columns “year” and “# AIDS cases diagnosed” for 
all questions unless otherwise stated. 


Graphing 
Graph “year” vs. “# AIDS cases diagnosed.” Plot the points on the graph 


located below in the section titled "Plot" . Do not include pre-1981. 
Label both axes with words. Scale both axes. 


Data 


Exercise: 


Problem: 
Enter your data into your calculator or computer. The pre-1981 data 
should not be included. Why is that so? 

Linear Equation 


Write the linear equation below, rounding to 4 decimal places: 


Note:For any prediction questions, the answers are calculated using the 
least squares (best fit) line equation cited in the solution. 


Exercise: 


Problem: Calculate the following: 


eaa— 
a 


* @ Corr: = 
e dn =(# of pairs) 


Solution: 


© aa = -3,448,225 


¢e bb= 1750 

e ccorr. = 0.4526 

eal = 22 
Exercise: 


Problem: equation: 7 = 


Solution: 


y = -3,448,225 +1750x 


Solve 


Exercise: 


Problem: Solve. 


e a When x = 1985, y = 

¢ b When x = 1990, y = 
Solution: 

© a 25,525 


e b34,275 


Plot 


Plot the 2 above points on the graph below. Then, connect the 2 points to 
form the regression line. 


Obtain the graph on your calculator or computer. 


Discussion Questions 


Look at the graph above. 
Exercise: 


Problem: Does the line seem to fit the data? Why or why not? 


Exercise: 


Problem: Do you think a linear fit is best? Why or why not? 
Exercise: 

Problem: 

Hand draw a smooth curve on the graph above that shows the flow of 

the data. 


Exercise: 


Problem: 


What does the correlation imply about the relationship between time 
(years) and the number of diagnosed AIDS cases reported in the U.S.? 


Exercise: 


Problem: 


Why is “year” the independent variable and “# AIDS cases 
diagnosed.” the dependent variable (instead of the reverse)? 


Exercise: 
Problem: Solve. 
e aWhen x = 1970, 9 =: 
e bWhy doesn’t this answer make sense? 
Solution: 


e a-/25 


Linear Regression and Correlation: Homework 

This module provides a homework for Linear Regression and Correlation as 
a part of Collaborative Statistics collection (col10522) by Barbara Illowsky 
and Susan Dean. 

Exercise: 


Problem: 


For each situation below, state the independent variable and the 
dependent variable. 


e aA study is done to determine if elderly drivers are involved in 
more motor vehicle fatalities than all other drivers. The number of 
fatalities per 100,000 drivers is compared to the age of drivers. 

e bA study is done to determine if the weekly grocery bill changes 
based on the number of family members. 

e cInsurance companies base life insurance premiums partially on 
the age of the applicant. 

e dUtility bills vary according to power consumption. 

e eA study is done to determine if a higher education reduces the 
crime rate in a population. 


Solution: 


e alndependent: Age; Dependent: Fatalities 
e dindependent: Power Consumption; Dependent: Utility 


Exercise: 
Problem: 
In 1990 the number of driver deaths per 100,000 for the different age 


groups was as follows (Source: The National Highway Traffic Safety 
Administration's National Center for Statistics and Analysis): 


Age 

15-24 
25-39 
40-69 
70-79 


80+ 


Number of Driver Deaths per 100,000 
28 
15 
10 
15 


25 


e aFor each age group, pick the midpoint of the interval for the x 
value. (For the 80+ group, use 85.) 


e bUsing “ages” as the independent variable and “Number of driver 
deaths per 100,000” as the dependent variable, make a scatter plot 
of the data. 

cCalculate the least squares (best-fit) line. Put the equation in the 
form of: ¥ = a + bx 

dFind the correlation coefficient. 

ePick two ages and find the estimated fatality rates. 

fUse the two points in (e) to plot the least squares line on your 
graph from (b). 

gBased on the above data, is there a linear relationship between 
age of a driver and driver fatality rate? 

hWhat is the slope of the least squares (best-fit) line? Interpret the 
slope. 


Exercise: 


Problem: 


The average number of people in a family that received welfare for 
various years is given below. (Source: House Ways and Means 
Committee, Health and Human Services Department) 


Year 


1969 


1973 


1975 


1979 


1983 


1988 


1991 


Welfare family size 
4.0 
3.6 
3.2 
3.0 
3.0 
3.0 


2:9 


e aUsing “year” as the independent variable and “welfare family 
size” as the dependent variable, make a scatter plot of the data. 
e bCalculate the least squares line. Put the equation in the form of: 


Y= O Dx 


e cFind the correlation coefficient. 

e dPick two years between 1969 and 1991 and find the estimated 
welfare family sizes. 

e eUse the two points in (d) to plot the least squares line on your 


graph from (b). 


e fBased on the above data, is there a linear relationship between 
the year and the average number of people in a welfare family? 

e gUsing the least squares line, estimate the welfare family sizes for 
1960 and 1995. Does the least squares line give an accurate 
estimate for those years? Explain why or why not. 

e hAre there any outliers in the above data? 

e iWhat is the estimated average welfare family size for 1986? 
Does the least squares line give an accurate estimate for that year? 
Explain why or why not. 

e jWhat is the slope of the least squares (best-fit) line? Interpret the 


slope. 


Solution: 


¢ by = 88.7206 — 0.04322 

c-0.8533 

gNo 

hNo. 

12.97, Yes 

jslope = -0.0432. As the year increases by one, the welfare family 
size decreases by 0.0432 people. 


Exercise: 
Problem: 
Use the AIDS data from the practice for this section, but this time use 


the columns “year #” and “# new AIDS deaths in U.S.” Answer all of 
the questions from the practice again, using the new columns. 


Exercise: 
Problem: 
The height (sidewalk to roof) of notable tall buildings in America is 


compared to the number of stories of the building (beginning at street 
level). (Source: Microsoft Bookshelf) 


Height (in feet) Stories 
1050 57 


428 28 


Height (in feet) Stories 
362 26 

529 AO 

790 60 

401 22 

380 38 
1454 110 
1127 100 
700 46 


aUsing “stories” as the independent variable and “height” as the 
dependent variable, make a scatter plot of the data. 

bDoes it appear from inspection that there is a relationship 
between the variables? 

cCalculate the least squares line. Put the equation in the form of: 
y=a+bx 

dFind the correlation coefficient. 

eFind the estimated heights for 32 stories and for 94 stories. 
fUse the two points in (e) to plot the least squares line on your 
graph from (b). 

gBased on the above data, is there a linear relationship between 
the number of stories in tall buildings and the height of the 
buildings? 

hAre there any outliers in the above data? If so, which point(s)? 
iWhat is the estimated height of a building with 6 stories? Does 
the least squares line give an accurate estimate of height? Explain 
why or why not. 


e jBased on the least squares line, adding an extra story adds about 
how many feet to a building? 

e kWhat is the slope of the least squares (best-fit) line? Interpret the 
slope. 


Solution: 


bYes 

cy = 102.4287 + 11.7585z 

d0.9436 

e478.70 feet; 1207.73 feet 

gYes 

hYes; (57,1050) 

e 1172.98; No 

e j11.7585 feet 

e kslope = 11.7585. As the number of stories increases by one, the 
height of the building increases by 11.7585 feet. 


Exercise: 


Problem: 


Below is the life expectancy for an individual born in the United States 
in certain years. (Source: National Center for Health Statistics) 


Year of Birth Life Expectancy 
1930 59.7 


1940 62.9 


Year of Birth 


1950 


1965 


1973 


1982 


1987 


1992 


Life Expectancy 
70.2 

69.7 

71.4 

74.5 

75 


7 Os? 


e aDecide which variable should be the independent variable and 
which should be the dependent variable. 

e bDraw a scatter plot of the ordered pairs. 

e cCalculate the least squares line. Put the equation in the form of: 


y=a+bx 


e dFind the correlation coefficient. 
e eFind the estimated life expectancy for an individual born in 1950 


and for one born in 1982. 


e fWhy aren’t the answers to part (e) the values on the above chart 
that correspond to those years? 
e gUse the two points in (e) to plot the least squares line on your 


graph from (b). 


e hBased on the above data, is there a linear relationship between 
the year of birth and life expectancy? 

e iAre there any outliers in the above data? 

e jUsing the least squares line, find the estimated life expectancy 
for an individual born in 1850. Does the least squares line give an 
accurate estimate for that year? Explain why or why not. 

e kWhat is the slope of the least squares (best-fit) line? Interpret the 


slope. 


Exercise: 


Problem: 


The percent of female wage and salary workers who are paid hourly 
rates is given below for the years 1979 - 1992. (Source: Bureau of 
Labor Statistics, U.S. Dept. of Labor) 


Year Percent of workers paid hourly rates 
1979 61.2 
1980 60.7 
1981 61.3 
1982 61.3 
1983 61.8 
1984 61.7 
1985 61.8 
1986 62.0 
1987 62.7 
1990 62.8 
1992 62.9 


e aUsing “year” as the independent variable and “percent” as the 
dependent variable, make a scatter plot of the data. 


e bDoes it appear from inspection that there is a relationship 
between the variables? Why or why not? 

e cCalculate the least squares line. Put the equation in the form of: 
y=—a+bx 

e dFind the correlation coefficient. 

e eFind the estimated percents for 1991 and 1988. 

e fUse the two points in (e) to plot the least squares line on your 
graph from (b). 

e gBased on the above data, is there a linear relationship between 
the year and the percent of female wage and salary earners who 
are paid hourly rates? 

e hAre there any outliers in the above data? 

e iWhat is the estimated percent for the year 2050? Does the least 
squares line give an accurate estimate for that year? Explain why 


or why not? 

e jWhat is the slope of the least squares (best-fit) line? Interpret the 
slope. 

Solution: 

e bYes 

° cy = —266.8863 + 0.1656z 

e d0.9448 

¢ e62.9206; 62.4237 

e hNo 

e 172.639; No 


e jslope = 0.1656. As the year increases by one, the percent of 
workers paid hourly rates increases by 0.1565. 


Exercise: 
Problem: 


The maximum discount value of the Entertainment® card for the “Fine 
Dining” section, Edition 10, for various pages is given below. 


Page number Maximum value ($) 


4 16 
14 19 
25 15 
32 17 
43 19 
57 15 
72 16 
85 15 
90 17 


e aDecide which variable should be the independent variable and 
which should be the dependent variable. 

e bDraw a scatter plot of the ordered pairs. 

e cCalculate the least squares line. Put the equation in the form of: 
y=—a+bx 

e dFind the correlation coefficient. 

e eFind the estimated maximum values for the restaurants on page 
10 and on page 70. 

e fUse the two points in (e) to plot the least squares line on your 
graph from (b). 

e gDoes it appear that the restaurants giving the maximum value 
are placed in the beginning of the “Fine Dining” section? How 
did you arrive at your answer? 

e hSuppose that there were 200 pages of restaurants. What do you 
estimate to be the maximum value for a restaurant listed on page 
200? 


e ils the least squares line valid for page 200? Why or why not? 
e jWhat is the slope of the least squares (best-fit) line? Interpret the 
slope. 


The next two questions refer to the following data: The cost of a leading 
liquid laundry detergent in different sizes is given below. 


Size (ounces) Cost ($) Cost per ounce 
16 3.99 
32 4.99 
64 9.99 
200 10.99 
Exercise: 
Problem: 


e aUsing “size” as the independent variable and “cost” as the 
dependent variable, make a scatter plot. 

e bDoes it appear from inspection that there is a relationship 
between the variables? Why or why not? 

e cCalculate the least squares line. Put the equation in the form of: 
y=—a+bx 

e dFind the correlation coefficient. 

e elf the laundry detergent were sold in a 40 ounce size, find the 
estimated cost. 


ff the laundry detergent were sold in a 90 ounce size, find the 
estimated cost. 

gUse the two points in (e) and (f) to plot the least squares line on 
your graph from (a). 

hDoes it appear that a line is the best way to fit the data? Why or 
why not? 

iAre there any outliers in the above data? 

jls the least squares line valid for predicting what a 300 ounce 
size of the laundry detergent would cost? Why or why not? 
kWhat is the slope of the least squares (best-fit) line? Interpret the 
slope. 


Solution: 


bYes 

cy = 3.5984 + 0.03712 

d0.9986 

e$5.08 

£$6.93 

iNo 

jNot valid 

kslope = 0.0371. As the number of ounces increases by one, the 
cost of the liquid detergent increases by $0.0371 (or about 4 
cents). 


Exercise: 


Problem: 


aComplete the above table for the cost per ounce of the different 
sizes. 

bUsing “Size” as the independent variable and “Cost per ounce” 
as the dependent variable, make a scatter plot of the data. 

cDoes it appear from inspection that there is a relationship 
between the variables? Why or why not? 

dCalculate the least squares line. Put the equation in the form of: 
y=—a+bx 


e eFind the correlation coefficient. 

e flf the laundry detergent were sold in a 40 ounce size, find the 
estimated cost per ounce. 

e glf the laundry detergent were sold in a 90 ounce size, find the 
estimated cost per ounce. 

e hUse the two points in (f) and (g) to plot the least squares line on 
your graph from (b). 

e iDoes it appear that a line is the best way to fit the data? Why or 
why not? 

e jAre there any outliers in the above data? 

e klIs the least squares line valid for predicting what a 300 ounce 
size of the laundry detergent would cost per ounce? Why or why 
not? 

e [What is the slope of the least squares (best-fit) line? Interpret the 
slope. 


Exercise: 
Problem: 
According to flyer by a Prudential Insurance Company representative, 


the costs of approximate probate fees and taxes for selected net taxable 
estates are as follows: 


Net Taxable Estate Approximate Probate Fees and 
($) Taxes ($) 

600,000 30,000 

750,000 92,500 


1,000,000 203,000 


Net Taxable Estate 


($) 

1,500,000 
2,000,000 
2,500,000 


3,000,000 


Approximate Probate Fees and 
Taxes ($) 


438,000 
688,000 
1,037,000 


1,350,000 


e aDecide which variable should be the independent variable and 
which should be the dependent variable. 

e¢ bMake a scatter plot of the data. 

e cDoes it appear from inspection that there is a relationship 
between the variables? Why or why not? 

e dCalculate the least squares line. Put the equation in the form of: 


y=a+bx 


e eFind the correlation coefficient. 

e fFind the estimated total cost for a net taxable estate of 
$1,000,000. Find the cost for $2,500,000. 

e gUse the two points in (f) to plot the least squares line on your 


graph from (b). 


e hDoes it appear that a line is the best way to fit the data? Why or 


why not? 


e iAre there any outliers in the above data? 

e jBased on the above, what would be the probate fees and taxes for 
an estate that does not have any assets? 

e kWhat is the slope of the least squares (best-fit) line? Interpret the 


slope. 


Solution: 


e cYes 


e dy = —337,424.6478 + 0.54632 


e0.9964 

£$208,872.49; $1,028,318.20 

hYes 

iNo 

e kslope = 0.5463. As the net taxable estate increases by one dollar, 
the approximate probate fees and taxes increases by 0.5463 
dollars (about 55 cents). 


Exercise: 


Problem: 


The following are advertised sale prices of color televisions at 
Anderson’s. 


Size (inches) Sale Price ($) 
9 147 

20 197 

27 297 

31 447 

35 1177 

40 2177 


60 2497 


aDecide which variable should be the independent variable and 
which should be the dependent variable. 

bMake a scatter plot of the data. 

cDoes it appear from inspection that there is a relationship 
between the variables? Why or why not? 

dCalculate the least squares line. Put the equation in the form of: 
y=—a+bx 

eFind the correlation coefficient. 

fFind the estimated sale price for a 32 inch television. Find the 
cost for a 50 inch television. 

gUse the two points in (f) to plot the least squares line on your 
graph from (b). 

hDoes it appear that a line is the best way to fit the data? Why or 
why not? 

iAre there any outliers in the above data? 

jWhat is the slope of the least squares (best-fit) line? Interpret the 
slope. 


Exercise: 


Problem: 


Below are the average heights for American boys. (Source: Physician’s 
Handbook, 1990) 


Age (years) Height (cm) 
birth 50.8 
2 83.8 


3 


91.4 


Age (years) Height (cm) 
5 106.6 
7 119.3 
10 137.1 
14 157.5 


aDecide which variable should be the independent variable and 
which should be the dependent variable. 

bMake a scatter plot of the data. 

cDoes it appear from inspection that there is a relationship 
between the variables? Why or why not? 

dCalculate the least squares line. Put the equation in the form of: 
y=—a+bx 

eFind the correlation coefficient. 

fFind the estimated average height for a one year—old. Find the 
estimated average height for an eleven year-old. 

gUse the two points in (f) to plot the least squares line on your 
graph from (b). 

hDoes it appear that a line is the best way to fit the data? Why or 
why not? 

iAre there any outliers in the above data? 

jUse the least squares line to estimate the average height for a 
sixty-two year-old man. Do you think that your answer is 
reasonable? Why or why not? 

kWhat is the slope of the least squares (best-fit) line? Interpret the 
slope. 


Solution: 


e cYes 


e dy = 65.0876 + 7.09482 


e0.9761 

£72.2 cm; 143.13 cm 

hYes 

iNo 

e j505.0 cm; No 

e kslope = 7.0948. As the age of an American boy increases by one 
year, the average height increases by 7.0948 cm. 


Exercise: 


Problem: 


The following chart gives the gold medal times for every other 
Summer Olympics for the women’s 100 meter freestyle (swimming). 


Year Time (seconds) 
1912 82.2 

1924 72.4 

1932 66.8 

1952 66.8 

1960 61.2 

1968 60.0 

1976 55.65 


1984 99.92 


Year Time (seconds) 


1992 54.64 


e aDecide which variable should be the independent variable and 
which should be the dependent variable. 

e¢ bMake a scatter plot of the data. 

e cDoes it appear from inspection that there is a relationship 
between the variables? Why or why not? 

e dCalculate the least squares line. Put the equation in the form of: 
y=—a+bx 

e eFind the correlation coefficient. 

e fFind the estimated gold medal time for 1932. Find the estimated 
time for 1984. 

e gWhy are the answers from (f) different from the chart values? 

e hUse the two points in (f) to plot the least squares line on your 
graph from (b). 

e iDoes it appear that a line is the best way to fit the data? Why or 
why not? 

e jUse the least squares line to estimate the gold medal time for the 
next Summer Olympics. Do you think that your answer is 
reasonable? Why or why not? 


The next three questions use the following state information. 


# Year 
letters entered Rank for Area 
in the entering the (square 


State name Union Union miles) 


# 
letters 
in 
State name 
Alabama 7 
Colorado 
Hawaii 
Iowa 
Maryland 


Missouri 


New 
Jersey 


Ohio 


South 


Carolina tS 


Utah 


Wisconsin 


Exercise: 


Problem: 


Year 
entered 
the 
Union 
1819 
1876 
1959 
1846 
1788 


1821 


1787 


1803 


1788 


1896 


1848 


Rank for 
entering the 
Union 

22 

38 

50 

29 

vi 


24 


17 


45 


30 


Area 
(square 
miles) 
52,423 
104,100 
10,932 
56,276 
12,407 


69,709 


8,722 


44,828 


32,008 


84,904 


65,499 


We are interested in whether or not the number of letters in a state 
name depends upon the year the state entered the Union. 


e aDecide which variable should be the independent variable and 
which should be the dependent variable. 

¢ bMake a scatter plot of the data. 

e cDoes it appear from inspection that there is a relationship 
between the variables? Why or why not? 

e dCalculate the least squares line. Put the equation in the form of: 
y=—a+bx 

e eFind the correlation coefficient. What does it imply about the 
significance of the relationship? 

e fFind the estimated number of letters (to the nearest integer) a 
state would have if it entered the Union in 1900. Find the 
estimated number of letters a state would have if it entered the 
Union in 1940. 

e gUse the two points in (f) to plot the least squares line on your 
graph from (b). 

e hDoes it appear that a line is the best way to fit the data? Why or 
why not? 

e iUse the least squares line to estimate the number of letters a new 
state that enters the Union this year would have. Can the least 
squares line be used to predict it? Why or why not? 


Solution: 


e cNo 

e dy = 47.03 — 0.216xz 
e e-0.4280 

e f6;5 


Exercise: 
Problem: 


We are interested in whether there is a relationship between the 
ranking of a state and the area of the state. 


e aLet rank be the independent variable and area be the dependent 
variable. 


¢ bWhat do you think the scatter plot will look like? Make a scatter 
plot of the data. 

e cDoes it appear from inspection that there is a relationship 
between the variables? Why or why not? 

e dCalculate the least squares line. Put the equation in the form of: 
y=—a+bx 

e eFind the correlation coefficient. What does it imply about the 
significance of the relationship? 

e fFind the estimated areas for Alabama and for Colorado. Are they 
close to the actual areas? 

e gUse the two points in (f) to plot the least squares line on your 
graph from (b). 

e hDoes it appear that a line is the best way to fit the data? Why or 
why not? 

e iAre there any outliers? 

e jUse the least squares line to estimate the area of a new state that 
enters the Union. Can the least squares line be used to predict it? 
Why or why not? 

e kDelete “Hawaii” and substitute “Alaska” for it. Alaska is the 
fortieth state with an area of 656,424 square miles. 

e ICalculate the new least squares line. 

e mFind the estimated area for Alabama. Is it closer to the actual 
area with this new least squares line or with the previous one that 
included Hawaii? Why do you think that’s the case? 

e nDo you think that, in general, newer states are larger than the 
original states? 


Exercise: 


Problem: 


We are interested in whether there is a relationship between the rank of 
a state and the year it entered the Union. 


e aLet year be the independent variable and rank be the dependent 
variable. 

¢ bWhat do you think the scatter plot will look like? Make a scatter 
plot of the data. 


e cWhy must the relationship be positive between the variables? 

e dCalculate the least squares line. Put the equation in the form of: 
y=—a+bx 

e eFind the correlation coefficient. What does it imply about the 
significance of the relationship? 

e fLet’s say a fifty-first state entered the union. Based upon the 
least squares line, when should that have occurred? 

e gUsing the least squares line, how many states do we currently 
have? 

e hWhy isn’t the least squares line a good estimator for this year? 


Solution: 


e dy = —480.5845 + 0.27482 
e e0.9553 
e £1934 


Exercise: 
Problem: 


Below are the percents of the U.S. labor force (excluding self- 
employed and unemployed ) that are members of a union. We are 
interested in whether the decrease is significant. (Source: Bureau of 
Labor Statistics, U.S. Dept. of Labor) 


Year Percent 
1945 S55 


1950 31.5 


Year Percent 
1960 31.4 
1970 2753 
1980 21.9 
1986 17.5 
1993 15.8 


aLet year be the independent variable and percent be the 
dependent variable. 

bWhat do you think the scatter plot will look like? Make a scatter 
plot of the data. 

cWhy will the relationship between the variables be negative? 
dCalculate the least squares line. Put the equation in the form of: 
y=a+bx 

eFind the correlation coefficient. What does it imply about the 
significance of the relationship? 

fBased on your answer to (e), do you think that the relationship 
can be said to be decreasing? 

glf the trend continues, when will there no longer be any union 
members? Do you think that will happen? 


The next two questions refer to the following information: The data 
below reflects the 1991-92 Reunion Class Giving. (Source: SUNY Albany 
alumni magazine) 


Class Year Average Gift Total Giving 


1922 41.67 125 
1927 60.75 1,215 
1932 83.82 Dl Te 
1937 87.84 5,710 
1947 88.27 6,003 
1952 76.14 5,204 
1957 52.29 4,393 
1962 57.80 4,451 
1972 42.68 18,093 
1976 49.39 22,473 
1981 46.87 20,997 
1986 37.03 12,590 
Exercise: 
Problem: 


We will use the columns “class year” and “total giving” for all 
questions, unless otherwise stated. 


e aWhat do you think the scatter plot will look like? Make a scatter 
plot of the data. 

¢ bCalculate the least squares line. Put the equation in the form of: 
y=a+bx 


cFind the correlation coefficient. What does it imply about the 
significance of the relationship? 

dFor the class of 1930, predict the total class gift. 

eFor the class of 1964, predict the total class gift. 

fFor the class of 1850, predict the total class gift. Why doesn’t 
this value make any sense? 


Solution: 


b y = —569,770.2796 + 296.0351 
c0.8302 

d$1577.48 

e$11,642.68 

f-$22,105.33 


Exercise: 


Problem: 


We will use the columns “class year” and “average gift” for all 
questions, unless otherwise stated. 


aWhat do you think the scatter plot will look like? Make a scatter 
plot of the data. 

bCalculate the least squares line. Put the equation in the form of: 
y=—a+bx 

cFind the correlation coefficient. What does it imply about the 
significance of the relationship? 

dFor the class of 1930, predict the average class gift. 

eFor the class of 1964, predict the average class gift. 

fFor the class of 2010, predict the average class gift. Why doesn’t 
this value make any sense? 


Try these multiple choice questions 


Exercise: 


Problem: 


A correlation coefficient of -0.95 means there is a 
between the two variables. 


e AStrong positive correlation 
e BWeak negative correlation 
e CStrong negative correlation 
e DNo Correlation 


Solution: 


C 
Exercise: 


Problem: 


According to the data reported by the New York State Department of 
Health regarding West Nile Virus for the years 2000-2004, the least 
squares line equation for the number of reported dead birds (x) versus 
the number of human West Nile virus cases (y) is 

y = —10.2638 + 0.04912. If the number of dead birds reported in a 
year is 732, how many human cases of West Nile virus can be 
expected? 


e A25.7 
e B46.2 
°¢ C-25.7 
e D7513 


Solution: 


A 


The next three questions refer to the following data: (showing the 
number of hurricanes by category to directly strike the mainland U.S. each 
decade) obtained from www.nhc.noaa.gov/gifs/table6.gif A major hurricane 
is one with a strength rating of 3, 4 or 5. 


Total Number of Number of Major 
Decade Hurricanes Hurricanes 
er | : 
i | 4s ; 
aE 
we 
mer . 
ir | 
am |g ; 


Exercise: 


Problem: 


Using only completed decades (1941 — 2000), calculate the least 
squares line for the number of major hurricanes expected based upon 
the total number of hurricanes. 


e Ay= —1.67z2 + 0.5 
¢ By = 0.5x — 1.67 
°« Cy=0.94x — 1.67 
e Dy=-2x+1 


Solution: 


A 
Exercise: 


Problem: 


The data for 2001-2004 show 9 hurricanes have hit the mainland 
United States. The line of best fit predicts 2.83 major hurricanes to hit 
mainland U.S. Can the least squares line be used to make this 
prediction? 


e ANo, because 9 lies outside the independent variable values 

e BYes, because, in fact, there have been 3 major hurricanes this 
decade 

e CNo, because 2.83 lies outside the dependent variable values 

e DYes, because how else could we predict what is going to happen 
this decade. 


Solution: 


A 


